Independent Component Analysis (ICA)#

ICA is a method for blind source separation: when you only observe mixtures, can you recover the original sources?

Think: “unmixing voices at a cocktail party.” Multiple speakers talk at once, each microphone records a different linear mixture, and ICA tries to recover the individual voices.


Learning goals#

  • explain ICA’s core assumptions (independence + non-Gaussianity)

  • distinguish independence vs uncorrelatedness with a counterexample

  • understand why non-Gaussianity is the signal ICA uses (CLT intuition)

  • implement whitening and see its effect on covariance

  • run ICA (FastICA) and compare with PCA visually (Plotly)

Prerequisites#

  • covariance / correlation

  • eigenvalues & eigenvectors (PCA-style intuition)

  • basic probability (Gaussian vs non-Gaussian)


Notation (linear mixing model)#

  • Sources (unknown): \(S \in \mathbb{R}^{n\times k}\)

  • Mixing matrix (unknown): \(A \in \mathbb{R}^{m\times k}\)

  • Observations (what we measure): \(X \in \mathbb{R}^{n\times m}\)

We assume linear instantaneous mixing:

\[ X = SA^T \]

ICA tries to estimate an unmixing matrix \(W\) so that:

\[ \hat{S} = XW^T. \]

In practice, \(\hat{S}\) is only identifiable up to permutation (ordering) and scaling/sign.


Table of contents#

  1. Intuition: cocktail party unmixing

  2. Statistical foundation

    • independence vs uncorrelated

    • non-Gaussianity (CLT intuition)

    • kurtosis / (approx) negentropy

  3. Algorithm mechanics

    • whitening

    • maximizing independence (FastICA idea)

  4. Plotly visualizations

    • mixed vs unmixed signals

    • PCA vs ICA comparison

  5. Use cases (signal processing, finance, neuroscience)

  6. Exercises + references

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.decomposition import FastICA, PCA
from sklearn.preprocessing import StandardScaler


pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) Intuition — “unmixing voices at a cocktail party”#

Imagine 3 people talking at the same time and 3 microphones in the room.

  • Each microphone records a different mixture of the 3 voices.

  • You only get the microphone recordings \(X\).

  • You want to recover the original voices \(S\) without knowing the room acoustics \(A\).

In a toy setting we can simulate this by:

  1. choosing three independent-looking source signals

  2. mixing them with a random matrix

  3. attempting to recover the sources from the mixtures

# Synthetic sources (toy "voices")
n_samples = 2500
t = np.linspace(0.0, 8.0, n_samples)

s1 = np.sin(2 * np.pi * 1.0 * t)
s2 = np.where(np.sin(2 * np.pi * 3.0 * t) >= 0.0, 1.0, -1.0)  # square-ish
s3 = 2.0 * ((t * 0.7) % 1.0) - 1.0  # sawtooth-ish ramp in [-1, 1)
s3 = s3 + 0.10 * rng.normal(size=n_samples)

S = np.c_[s1, s2, s3]
S = StandardScaler().fit_transform(S)  # each source: zero mean, unit variance

# Random mixing matrix ("room")
A = rng.normal(size=(3, 3))
while abs(np.linalg.det(A)) < 1e-3:
    A = rng.normal(size=(3, 3))

X = S @ A.T
X = StandardScaler().fit_transform(X)  # each observed channel: zero mean, unit variance
fig = make_subplots(
    rows=3,
    cols=2,
    shared_xaxes=True,
    vertical_spacing=0.06,
    column_titles=("Sources (unknown)", "Observed mixtures (microphones)"),
)

for i in range(3):
    fig.add_trace(go.Scatter(x=t, y=S[:, i], mode="lines", line=dict(width=1)), row=i + 1, col=1)
    fig.add_trace(go.Scatter(x=t, y=X[:, i], mode="lines", line=dict(width=1)), row=i + 1, col=2)

for i in range(3):
    fig.update_yaxes(title_text=f"s{i + 1}", row=i + 1, col=1)
    fig.update_yaxes(title_text=f"x{i + 1}", row=i + 1, col=2)

fig.update_xaxes(title_text="time", row=3, col=1)
fig.update_xaxes(title_text="time", row=3, col=2)
fig.update_layout(title="Toy cocktail party: sources vs mixtures", showlegend=False, height=650)
fig.show()

Two important “gotchas” that are features, not bugs:

  • Permutation ambiguity: ICA doesn’t know which source is “voice 1” vs “voice 2”.

  • Scaling/sign ambiguity: multiplying a source by 2 and dividing the corresponding column of \(A\) by 2 leaves \(X\) unchanged. Signs can flip too.

So success looks like recovering the shapes, not recovering a canonical ordering or amplitude.

2) Statistical foundation#

2.1 Independence vs uncorrelated#

  • Uncorrelated means the linear relationship is absent: \(\mathrm{Cov}(X, Y) = 0\).

  • Independent means the entire joint distribution factorizes: \(p(x, y) = p(x)p(y)\).

Independence is strictly stronger. Here’s a classic counterexample: \(Y = X^2\) can be uncorrelated with \(X\), but it’s obviously not independent.

n = 4000
x = rng.uniform(-1.0, 1.0, size=n)
y = x**2

corr = float(np.corrcoef(x, y)[0, 1])

fig = px.scatter(
    x=x,
    y=y,
    opacity=0.4,
    title=f"Uncorrelated ≠ independent (corr ≈ {corr:.3f})",
    labels={"x": "X", "y": "Y = X²"},
)
fig.show()

ICA wants independent components, not merely uncorrelated ones.

PCA only guarantees uncorrelated components (a second-order property). ICA tries to match a stronger, higher-order notion: independence.

2.2 Non-Gaussianity (why ICA has signal to grab)#

A key intuition comes from the Central Limit Theorem (CLT):

A sum of (many) independent random variables tends to look more Gaussian.

Mixing independent sources produces observations that are sums of sources, so the mixtures often look more Gaussian than the original signals.

ICA works by finding directions (linear combinations) that look maximally non-Gaussian, because those directions are good candidates for the original sources.

fig = go.Figure()

fig.add_trace(go.Histogram(x=S[:, 1], name="source s2 (square-ish)", nbinsx=60, opacity=0.65))
fig.add_trace(go.Histogram(x=X[:, 0], name="mixture x1", nbinsx=60, opacity=0.65))

fig.update_layout(
    barmode="overlay",
    title="Mixing tends to look 'more Gaussian' (CLT intuition)",
    xaxis_title="value",
    yaxis_title="count",
)
fig.show()

2.3 Kurtosis / (approx) negentropy#

Non-Gaussianity can be quantified in many ways. Two common ideas:

  • Excess kurtosis: for a standardized variable, \(\kappa = \mathbb{E}[Z^4] - 3\). A Gaussian has \(\kappa = 0\).

  • Negentropy: how far a distribution’s entropy is from a Gaussian with the same variance. (Hard to compute exactly; ICA often uses proxies.)

FastICA (and many ICA variants) use a contrast function that increases when a projected signal deviates from Gaussian.

def standardize_1d(x: np.ndarray) -> np.ndarray:
    x = np.asarray(x)
    x = x - x.mean()
    return x / (x.std() + 1e-12)


def excess_kurtosis(x: np.ndarray) -> float:
    z = standardize_1d(x)
    return float(np.mean(z**4) - 3.0)


def negentropy_proxy(x: np.ndarray, n_ref: int = 20000, seed: int = 0) -> float:
    """A simple negentropy proxy using the log-cosh contrast.

    This is *not* an exact negentropy computation; it’s an intuitive scalar that
    increases when a distribution deviates from Gaussian.
    """
    z = standardize_1d(x)
    rng_local = np.random.default_rng(seed)
    v = rng_local.standard_normal(size=n_ref)

    def G(u):
        u = np.clip(u, -10.0, 10.0)
        return np.log(np.cosh(u))

    return float((np.mean(G(z)) - np.mean(G(v))) ** 2)


labels = ["1", "2", "3"]

kurt_S = [excess_kurtosis(S[:, i]) for i in range(3)]
kurt_X = [excess_kurtosis(X[:, i]) for i in range(3)]

neg_S = [negentropy_proxy(S[:, i], seed=10 + i) for i in range(3)]
neg_X = [negentropy_proxy(X[:, i], seed=20 + i) for i in range(3)]

fig = make_subplots(rows=1, cols=2, subplot_titles=("Excess kurtosis", "Negentropy proxy (log-cosh)"))

fig.add_trace(go.Bar(x=labels, y=kurt_S, name="sources"), row=1, col=1)
fig.add_trace(go.Bar(x=labels, y=kurt_X, name="mixtures"), row=1, col=1)

fig.add_trace(go.Bar(x=labels, y=neg_S, name="sources"), row=1, col=2)
fig.add_trace(go.Bar(x=labels, y=neg_X, name="mixtures"), row=1, col=2)

fig.update_layout(barmode="group", title="Non-Gaussianity proxies (toy example)")
fig.update_xaxes(title_text="component", row=1, col=1)
fig.update_xaxes(title_text="component", row=1, col=2)
fig.show()

3) Algorithm mechanics#

Most ICA algorithms have the same shape:

  1. center: subtract the mean

  2. whiten: remove second-order correlations (Cov becomes identity)

  3. rotate: find directions that maximize independence (usually via non-Gaussianity)

3.1 Whitening#

Whitening transforms the observations so their covariance matrix becomes approximately the identity.

Why it helps:

  • It removes all second-order structure (correlations).

  • After whitening, the remaining mixing is (roughly) a rotation.

  • ICA can then focus on higher-order structure to find independent components.

def whiten(X: np.ndarray, eps: float = 1e-12):
    """Center + whiten so that Cov(X_white) ≈ I."""
    Xc = X - X.mean(axis=0, keepdims=True)
    cov = (Xc.T @ Xc) / Xc.shape[0]

    eigvals, eigvecs = np.linalg.eigh(cov)
    order = np.argsort(eigvals)[::-1]
    eigvals = eigvals[order]
    eigvecs = eigvecs[:, order]

    D_inv_sqrt = np.diag(1.0 / np.sqrt(eigvals + eps))
    W = eigvecs @ D_inv_sqrt @ eigvecs.T
    X_white = Xc @ W.T
    return X_white, W


X_white, W_white = whiten(X)

cov_X = (X.T @ X) / X.shape[0]
cov_Xw = (X_white.T @ X_white) / X_white.shape[0]

fig = make_subplots(rows=1, cols=2, subplot_titles=("Cov(X) (mixtures)", "Cov(whitened X) (≈ I)"))

fig.add_trace(go.Heatmap(z=cov_X, zmin=-1, zmax=1, colorscale="RdBu"), row=1, col=1)
fig.add_trace(go.Heatmap(z=cov_Xw, zmin=-1, zmax=1, colorscale="RdBu"), row=1, col=2)

fig.update_layout(title="Whitening removes second-order correlations")
fig.show()

3.2 Maximizing independence (FastICA idea)#

After whitening, ICA searches for weight vectors \(w\) so that the projection \(w^T x\) is maximally non-Gaussian.

A common fixed-point update (one component) looks like:

\[ w \leftarrow \mathbb{E}[x\,g(w^T x)] - \mathbb{E}[g'(w^T x)]\,w \]

followed by normalization and orthogonalization against previously found components.

The nonlinearity \(g\) (e.g. tanh) is chosen so the update increases a non-Gaussianity contrast (a proxy for independence).

from itertools import permutations


def corr_matrix(A: np.ndarray, B: np.ndarray) -> np.ndarray:
    """Signed correlation between columns of A and B."""
    A0 = A - A.mean(axis=0, keepdims=True)
    B0 = B - B.mean(axis=0, keepdims=True)
    A0 = A0 / (A0.std(axis=0, keepdims=True) + 1e-12)
    B0 = B0 / (B0.std(axis=0, keepdims=True) + 1e-12)
    return (A0.T @ B0) / A0.shape[0]


def match_by_correlation(S_true: np.ndarray, S_est: np.ndarray):
    """Match estimated components to true sources (perm + sign).

    This is only for visualization in toy data where we *know* the sources.
    """
    C = corr_matrix(S_true, S_est)
    k = C.shape[0]
    best_perm = None
    best_score = -np.inf
    for perm in permutations(range(k)):
        score = sum(abs(C[i, perm[i]]) for i in range(k))
        if score > best_score:
            best_score = score
            best_perm = perm

    perm = np.array(best_perm)
    signs = np.sign([C[i, perm[i]] for i in range(k)]).astype(float)
    signs = np.where(signs == 0.0, 1.0, signs)

    S_matched = S_est[:, perm] * signs
    return S_matched, perm, signs, C


pca = PCA(n_components=3, random_state=42)
S_pca = pca.fit_transform(X)
S_pca = StandardScaler().fit_transform(S_pca)

ica = FastICA(n_components=3, random_state=42, whiten="unit-variance", max_iter=2000)
S_ica = ica.fit_transform(X)
S_ica = StandardScaler().fit_transform(S_ica)

S_pca_matched, perm_pca, signs_pca, C_pca_raw = match_by_correlation(S, S_pca)
S_ica_matched, perm_ica, signs_ica, C_ica_raw = match_by_correlation(S, S_ica)

abs_corr_pca = np.abs(corr_matrix(S, S_pca_matched))
abs_corr_ica = np.abs(corr_matrix(S, S_ica_matched))

3.3 Practical ICA with scikit-learn#

sklearn.decomposition.FastICA is a solid default implementation of the FastICA family.

Data layout (common gotcha)

  • rows = samples (here: time points)

  • columns = observed mixtures (here: microphone/sensor channels)

Key parameters

  • n_components: how many independent components to extract

  • whiten: 'unit-variance' (components scaled to unit variance) vs 'arbitrary-variance' vs False

  • algorithm: 'parallel' (estimate all components together) vs 'deflation' (one-by-one)

  • fun: contrast / nonlinearity ('logcosh' default, 'exp', 'cube')

  • max_iter, tol: convergence control

What you get back

  • fit_transform(X) returns \(\hat{S}\) (up to permutation/sign/scale)

  • mixing_ and components_ are estimated mixing/unmixing matrices (same ambiguities)

In the toy problem we can match components to sources using correlation only because we know the ground truth.

4) Plotly visualizations#

4.1 Mixed vs unmixed signals#

In real life you don’t have access to the true sources — but in a toy dataset we can show the idea clearly.

fig = make_subplots(
    rows=3,
    cols=3,
    shared_xaxes=True,
    vertical_spacing=0.06,
    column_titles=("Sources (truth)", "Mixtures (observed)", "ICA (unmixed)"),
)

for i in range(3):
    fig.add_trace(go.Scatter(x=t, y=S[:, i], mode="lines", line=dict(width=1)), row=i + 1, col=1)
    fig.add_trace(go.Scatter(x=t, y=X[:, i], mode="lines", line=dict(width=1)), row=i + 1, col=2)
    fig.add_trace(
        go.Scatter(x=t, y=S_ica_matched[:, i], mode="lines", line=dict(width=1)),
        row=i + 1,
        col=3,
    )

for i in range(3):
    fig.update_yaxes(title_text=f"{i + 1}", row=i + 1, col=1)
    fig.update_yaxes(title_text=f"{i + 1}", row=i + 1, col=2)
    fig.update_yaxes(title_text=f"{i + 1}", row=i + 1, col=3)

fig.update_xaxes(title_text="time", row=3, col=1)
fig.update_xaxes(title_text="time", row=3, col=2)
fig.update_xaxes(title_text="time", row=3, col=3)
fig.update_layout(title="Mixed vs unmixed signals (toy example)", showlegend=False, height=650)
fig.show()

4.2 PCA vs ICA comparison#

PCA and ICA both produce linear components, but they optimize different objectives:

  • PCA: directions of maximum variance (components are orthogonal and uncorrelated).

  • ICA: directions of maximum independence (using non-Gaussianity).

In this toy problem, ICA aligns components with sources much better than PCA.

labels_sources = ["s1", "s2", "s3"]
labels_components = ["c1", "c2", "c3"]

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("PCA: |corr(component, source)|", "ICA: |corr(component, source)|"),
)

fig.add_trace(
    go.Heatmap(
        z=abs_corr_pca.T,
        x=labels_sources,
        y=labels_components,
        zmin=0,
        zmax=1,
        colorscale="Blues",
    ),
    row=1,
    col=1,
)

fig.add_trace(
    go.Heatmap(
        z=abs_corr_ica.T,
        x=labels_sources,
        y=labels_components,
        zmin=0,
        zmax=1,
        colorscale="Blues",
    ),
    row=1,
    col=2,
)

fig.update_layout(title="PCA vs ICA on the same mixtures", height=420)
fig.show()

5) Use cases#

ICA is most useful when “mixtures of independent-ish things” is a good model.

Signal processing#

  • Audio source separation (cocktail party) when mixing is approximately linear.

  • Denoising / artifact removal: separate structured signal from independent noise-like components.

Finance#

  • Discover latent factors behind correlated asset returns (market/sector/style components).

  • Separate independent drivers for risk modeling or portfolio construction.

Neuroscience#

  • EEG/MEG: remove artifacts (eye blinks, muscle activity) by identifying independent sources.

  • fMRI: identify spatially independent activation patterns (often with additional constraints).

Practical pitfalls#

  • ICA is sensitive to preprocessing (centering, whitening, scaling).

  • Components are only defined up to permutation/sign/scale.

  • If the independence/non-Gaussian assumptions are wrong, ICA can return unstable or uninterpretable components.

6) Exercises + references#

Exercises#

  1. Make one source Gaussian and see when ICA becomes ambiguous.

  2. Increase the number of sources/microphones and check when recovery degrades.

  3. Compare kurtosis/negentropy of PCA vs ICA components on the same mixtures.

References#

  • Hyvärinen & Oja (2000): Independent Component Analysis: Algorithms and Applications

  • scikit-learn docs: sklearn.decomposition.FastICA

  • Cardoso (1998): Blind signal separation: statistical principles